Hi, welcome to this fourth lesson, on advanced data exploration.
In the previous lessons, you have explored you first dataset with aggregate data at the geographical region level, and created different types of charts that you combined into your first dashboard.
In this lesson, you will explore a new dataset, with raw individual data, create charts to visualize quantative data distributions and the relation between categorical variables, and aggregate your individual data by different columns in the dataset.
By the end of this lesson,
First, log into the EpiGraphHub platform and go to the Datasets page.
There, type positive_cases_covid_d in the SEARCH field, and
click on the dataset with this name (the dataset highlighted below).
The following page will open:
In this dataset, we will work with individual data on positive cases
of COVID-19 in Colombia. To do some initial exploration, we will select
the RAW RECORDS QUERY MODE and the following columns of the
dataset:
Below is the meaning of these columns:
fecha_inicio_sintomas: It represents the date of the
first symptoms.fecha_reporte_web: It represents the date that the
record was available online.departamento_nom: It represents the location of the
case in Colombia.edad: It represents the age of the case.unidad_medida: It indicates if the age in the column
edad is represented by years (1), months (2), or days
(3).sexo: It represents the sex of the case.estado: It represents the disease severity.
Leve equals mild symptoms, Moderado equals
moderate symptoms, Grave equals severe symptoms, and
Fallecido represents the individuals who died of
covid-19.After selecting these columns and clicking on the
RUN QUERY button, that’s the result:
Now, let’s do some further exploration with these columns.
Remember that to see the result of a query as a table you need to set
VISUALISATION TYPE to Table.
In this section, let’s learn how to make simple operations that will help you to better understand your data.
To discover how many NULL VALUES there are in a column,
select the AGGREGATE QUERY MODE, and add the metric
COUNT(*) that will count the number of rows returned:
Also, in the FILTERS field, add your column of interest,
with the operator IS NULL.
For example, to discover the number of null values in the column
fecha_inicio_sintomas you must use the filter:
And that’s the result:
This column has 489’336 null values.
To get the total number of rows in the table (6.1 million), run this
query (using the metrics COUNT(*)) without applying any
filter.
To remove the NULL VALUES in a specific column, e.g.,
fecha_inicio_sintomas, for the remaining of the
exploration, go to the FILTERS section and apply the following
filter:
And click on the SAVE button.
To see the number of unique values in a column, stay on the
AGGREGATE QUERY MODE. For example, to see the unique values
in the sexo column and count the number of covid-19 cases
per sexo value, select this column in the GROUP BY field
and add COUNT(sexo) or COUNT(*) in the METRICS
field below.
You will then see these fields filled on your screen:
After clicking on the RUN QUERY button, this will be the
result:
In the region highlighted above, you can see the exact number of
F (females) and M (males) observations.
The histogram can be used to provide the distribution of our data.
Using the positive_cases_covid_d, let’s plot a histogram of
the age of dead individuals.
In this case, we will use the column edad (that refers
to the age of the individual), the column unidad_medida (to ensure that
the value in edad represents years), and the column estado
to filter by the Fallecido (dead individuals).
To do this, click on the Table value to change the
VISUALISATION TYPE:
In the window that will open, type histogram:
Click on the histogram and on the SELECT button.
In the Explorae Page, select the column edad in the
COLUMNS field and apply the two filters presented below:
So, your final query configuration is:
After clicking on the RUN QUERY button, this will be the
displayed result:
To change the number of bins in the histogram, go to the CUSTOMIZE tab:
and change the NO OF BINS field:
After setting NO OF BINS to 20, you will get the result
below:
If you hover the mouse over the bars, you get the information about
what range of the edad (age) the bar represents, what is
the exact count, cumulative, and percentile that this bar
represents:
The boxplots can be used to summarise the statistics of the values in a column.
Analogous to what was done in the previous section, using the
positive_cases_covid_d dataset, we will plot a boxplot of
the age of dead individuals by department.
In this case, we will use the columns:
edad that refers to the age of the individual,unidad_medida to ensure that the value in
edad represents years, andestado to filter by the Fallecido (dead
individuals).To do this, on the Exploration page, click on the Table
value to change the VISUALIZATION TYPE:
In the window that will open, type box plot:
Click on the Box Plot and on the SELECT button.
For this chart, the METRICS field is mandatory. As we are interested
in the age distribution, let’s start by setting the METRICS field to
AVG(edad) in the Query section.
And RUN QUERY. You should then get the following result,
with only one box plot representing all age values in our dataset:
By default, the temporal column defined in TIME COLUMN and the TIME GRAIN are used to compute the value in METRICS.
In the result above, it means that :
Day based
on the fecha_reporte_webcolumn, thenAVG(edad)) by day were
computed, and eventuallyTo convince yourself about the explanation above, change the operator in the METRICS field from AVG to SUM, and see how it impacts the plotted distribution.
Now, to see the distribution of the age of individuals, without prior
aggregation by time, we will fill the field DISTRIBUTE ACROSS with the
id_de_caso column, representing the unique case ID of each
record in our dataset . This will ensure that the box plot will use the
age of each individual patient.
As a consequence of using DISTRIBUTE ACROSS =
id_de_caso, selecting the AGGREGATE operator in METRICS to
be SUM or AVG will not change the result distribution. Let’s then keep
METRICS as AVG(edad).
To get the age in years only for the dead individuals, let’s apply the two filters presented below:
In the SERIES field, we will select the column with
unique values to be shown along the X axis. For each unique value in
this column, a box plot will be computed.
To get one box plot for each department in Colombia, let’s set SERIES
to the column departamento_nom. So, your final query
configuration is the following:
After clicking on the RUN QUERY button, this will be the
received result:
To see the name of the departamento_nom associated with
each box plot, we can rotate the X axis labels. To do this, go to the
CUSTOMISE tab:
and change the X TICK LAYOUT field to 90º:
and that’s the result:
If you hover the mouse over a box plot, you get the information about the quartiles, observation, and outliers:
You can change the time range considered in the region below:
And in the field below, you can change the type of box plot:
By default, it uses Tukey, where the min and max values
are at most 1.5 times the IQR (interquartile range) from the first
quartile (25 percentile) and third quartile (75 percentile),
respectively. The other available options are:
Min/max (no outliers);2/98 percentiles;9/91 percentiles.A pivot table is a good visualisation to see the relation of discrete
values in two different columns of a dataset. Let’s use this
visualisation to see the total number of cases related to the values in
the column estado that represent the severity of the
COVID-19 case and the column ubicacion that represents
where the covid-19 case was treated.
Below is the meaning of the values that will appear in the visualisation:
Casa: It means that the individual was treated at
home;Fallecido: It means that the individual died from
covid-19;Leve: It means that the individual had mild covid-19
symptoms;Moderado: It means that the individual had moderate
covid-19 symptoms;Grave: It means that the individual had severe covid-19
symptoms.To do a pivot table, click on the Table value to change
the VISUALIZATION TYPE:
In the window that will open, type Pivot table:
Click on the Pivot Table v2 and on the
SELECT button.
We need to select a column to be used as columns of the pivot table
and another to be used as rows of the pivot table. In this example,
let’s use ubicacion in the field COLUMNS and
estado in the field ROWS.
To see the number of cases grouped by the values in the columns and
rows of the pivot table let’s use as metrics COUNT(*), to
count all the records returned in this aggregation.
So, the final query is:
To avoid the columns with values not informed, let’s create 2 different filters. The first is to ignore the null values and the second is to ignore the values equal to “N/A” that represent the individuals that died but not by covid.
The <> represents the not equals
operator.
Then click on the RUN button, and the result will
be:
In the Options section selected below you have the
following possibilities:
If you select the first two items (SHOW ROWS TOTAL, and
SHOW COLUMNS TOTAL), that is the result:
Furthermore, it’s possible to change the aggregation function used to
compute the total values in the rows in the columns by changing the
value in the select box. The default value is Sum.
It’s also possible to transpose the table, by selecting the
TRANSPOSE PIVOT and put different metrics side by side (if
you add more than one metric in the METRICS field) by
selecting the COMBINE METRICS.
As mentioned before, the positive_cases_covid_d dataset
represents individual data. So, we will use the AGGREGATE
QUERY MODE to get the aggregate number of cases according to some
conditions such as date of first symptoms, sex, department, and disease
severity.
To do this, we will use the Time-series Bar Chart v2
VISUALISATION TYPE. When we select this visualisation, we have the
following fields to fill:
We need to select the columns that we want to aggregate in the GROUP
BY field. For the conditions listed above, we will add the columns
sexo and estado in the GROUP BY field.
We also need to select a TIME COLUMN, for this visualisation. So,
let’s use the fecha_inicio_sintomas to aggregate the data
by the date of the first symptoms. Remember that you can fill this field
with any other time column. Let’s also change the TIME GRAIN to
Week and select the data in the first quarter of 2022.
To count the total number of observations aggregated by the TIME
COLUMN and columns in the GROUP BY field let’s use the metrics
COUNT(*).
In the FILTERS field, let’s add a filter to ignore the values
N/A in the column estado, which represent the
individuals that died, but not as a consequence of covid-19.
So, the final query configuration is:
After running this query (clicking on the RUN QUERY
button), that is the result:
To improve the chart, let’s go to the CUSTOMIZE tab and select the
checkboxes STACK SERIES and SHOW LEGEND:
So, after changing the title of the plot to
COVID-19 cases in Colombia - Severity by sex over time, the
result is:
Note that with this configuration, the chart is much more informative. If you hover the mouse over a bar you can see the exact value of cases according to each group.
You also can visualise the result of this aggregation in a tabular
way by looking at the Data section:
Remember that you can download this data by clicking on the
.csv or .json icon above the chart.
In this example, we use the data available for all the departments in Colombia. Practice by :
departamento_nom,
e.g. departamento_nom = ‘BOGOTA’.sexo
(e.g. sexo = ‘M’) and estado
(e.g. estado IN (‘Moderado’,’Leve’)).Congratulations! You reached the end of this fourth lesson on advanced data exploration.
Hopefully, you are now feel comfortable with the exploration of datasets with raw individual data and their aggregation. By now, you should also know how to inspect the values of different columns in your dataset, and visualise the distribution of quantitative values and the relationship between categorical variables.
In the next lesson you will go further into dashboard creation and learn how to enrich your dashboards. See you there :)
The following team members contributed to this lesson:
This work is licensed under the Creative Commons Attribution Share Alike license.